Speech encoding/decoding apparatus having selected encoders

ABSTRACT

Several encoders perform a local decoding of a speech signal and extract excitation information and vocal tract information from a speech signal for an encoding operation. The transmission rate ratio between the excitation information and the vocal tract information are different for each encoder. An evaluation/selection unit evaluates the quality of decoded signals subjected to a local decoding in each of the encoders, determines the most suitable encoders from among the several encoders based on the result of the evaluation, and selects the most suitable encoder, thereby outputting the selection result as selection information. The decoder decodes a speech signal based on selection information, vocal tract information and excitation information. The evaluation/selection unit selects the output from the encoder in which the quality of a locally decoded signal is the most preferable. When vocal tract information changes little, the vocal tract information is not output, thereby allowing for increased quality of information. As much of the surplus of unused vocal tract information as possible is assigned to a residual signal. Thus, the quality of a decoded speech signal is improved.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a speech encoding and decodingapparatus for transmitting a speech signal after information compressionprocessing has been applied.

Recently, a speech encoding and decoding apparatus for compressingspeech information to data of about 4 to 16 kbps at a high efficiencyhas been demanded for in-house communication systems, digital mobileradio systems and speech storing systems.

2. Description of Related Art

As the first prior art structure of a speech prediction encodingapparatus, there is provided an adaptive prediction encoding apparatusfor multiplexing the prediction parameters (vocal tract information) ofa predictor and residual signal (excitation information) fortransmission to the receiving station.

FIG. 1 is a block diagram of an encoder used in the speech encodingapparatus of the first prior art structure. Encoder 100, compriseslinear prediction analysis unit 101, predictor 102, quantizer 103,multiplexing unit 104 and adders 105 and 106.

Linear prediction analysis unit 101 analyzes input speech signals andoutputs prediction parameters, and predictor 102 predicts input signalsusing an output from adder 106 (described below) and predictionparameters from linear prediction analysis unit 101. Adder 105 outputserror data by computing the difference between an input speech signaland the predicted signal, quantizer 103 obtains a residual signal byquantizing the error data, and adder 106 adds the output from predictor102 to that of quantizer 103, thereby enabling the output to be fed backto predictor 102. Multiplexing unit 104 multiplexes predictionparameters from linear prediction analysis unit 101 and a residualsignal from quantizer 103 for transmission to a receiving station.

With such a structure, linear prediction analysis unit 101 performs alinear prediction analysis of an input signal at every predeterminedframe period, thereby extracting prediction parameters as vocal tractinformation to which appropriate bits are assigned by an encoder (notshown). The prediction parameters are thus encoded and output topredictor 102 and multiplexing unit 104. Predictor 102 predicts an inputsignal based on the prediction parameters and an output from adder 106.Adder 105 computes the error data (the difference between the predictedinformation and the input signal), and quantizer 103 quantizes the errordata, thereby assigning appropriate bits to the error data to provide aresidual signal. This residual signal is output to multiplexing unit 104as excitation information.

After that, the encoded prediction parameter and residual signal aremultiplexed by multiplexing unit 104 and transmitted to a receivingstation.

Adder 106 adds an input signal predicted by predictor 102 and a residualsignal quantized by quantizer 103. An addition output is again input topredictor 102 and is used to predict the input signal together with theprediction parameters.

In this case, the number of bits assigned to prediction parameters foreach frame is fixed at α-bits per frame and the number of bits assignedto the residual signal is fixed at β-bits per frame. Therefore, the(α+β) bits for each frame are transmitted to the receiving station. Inthis case, the transmission rate is, for example, 8 kbps.

FIG. 2 is a block diagram showing a second prior art structure of thespeech encoding apparatus. This prior art structure is a Code ExcitedLinear Prediction (CELP) encoder which is known as a low bit rate speechencoder.

Principally, a CELP encoder, like the first prior art structure shown inFIG. 1, is an apparatus for encoding and transmitting linear predictioncode parameters (LPC or prediction parameters) obtained from an LPCanalysis and a residual signal. However, this CELP encoder represents aresidual signal by using one of the residual patterns within a codebook, thereby obtaining high efficiency encoding.

Details of CELP are disclosed in Atal, B. S., and Schroeder, M. R."Stochastic Coding of Speech at Very Low bit Rate" Proc.ICASSP 84-1610to 1613, 1984, and a summary of the CELP encoder will be explained asfollows by referring to FIG. 2.

LPC analysis unit 201 performs a LPC analysis of an input signal, andquantizer 202 quantizes the analyzed LPC parameters to be supplied topredictor 203. Pitch period m, pitch coefficient Cp and gain G, whichare not shown, are extracted from the input signal.

A residual waveform pattern (code vector) is sequentially read out fromthe code book 204 and its respective pattern is, at first, input tomultiplier 205 and multiplied by gain G. Then, the output is input to afeed-back loop, namely, a long-term predictor comprising delay circuit206, multiplier 207 and adder 208, to synthesize a residual signal. Thedelay value of delay circuit 206 is set at the same value as the pitchperiod. Multiplier 207 multiplies the output from delay circuit 206 bypitch coefficient Cp.

A synthesized residual signal output from adder 208 is input to afeed-back loop, namely, a short term prediction unit comprisingpredictor 203 and adder 209, and the predicted input signal issynthesized. The prediction parameters are LPC parameters from quantizerunit 202. The predicted input signal is subtracted from an input signalat subtracter 210 to provide an error signal. Weight function unit 211applies weight to the error signal, taking into consideration theacoustic characteristics of humans. This is a correcting process to makethe error to a human ear uniform as the influence of the error on thehuman ear is different depending on the frequency band.

The output of weight function unit 211 is input to error powerevaluation unit 212 and an error power is evaluated in respectiveframes.

A white noise code book 204 has a plurality of samples of residualwaveform patterns (code vectors), and the above series of processes isrepeated with regard to all the samples. A residual waveform patternwhose error power within a frame is minimum is selected as a residualwaveform pattern of the frame.

As described above, the index of the residual waveform pattern obtainedfor every frame as well as LPC parameters from quantizer 202, pitchperiod m, pitch coefficient Cp and gain G are transmitted to a receivingstation (not shown). The receiving station forms a long-term predictorwith transmitted pitch period m and pitch coefficient Cp as is similarto the above case, and the residual waveform pattern corresponding to atransmitted index is input to the long-term predictor, therebyreproducing a residual signal. Further, the transmitted LPC parametersform a short-term predictor as is similar to the above case, and thereproduced residual signal is input to the short-term predictor, therebyreproducing an input signal.

Respective dynamic characteristics of an excitation unit and a vocaltract unit in a sound producing structure of a human are different, andthe respective data quantity to be transmitted at arbitrary points bythe excitation unit and vocal tract unit are different.

However, with a conventional speech encoding apparatus as shown in FIGS.1 or 2, excitation information and vocal tract information aretransmitted at a fixed ratio of data quantity. The above speechcharacteristics are not utilized. Therefore, when the transmission rateis low, quantization becomes coarse, thereby increasing noise and makingit difficult to maintain satisfactory speech quality.

The above problem is explained as follows with regard to theconventional examples shown in FIGS. 1 or 2.

In a speech signal there exists a period in which characteristics changeabruptly and a period in which the state is constant, and the lattervalue of the prediction parameters do not change too much. Namely, thereare cases where co-relationship between the prediction parameters (LPCparameters) in continuous frames is strong, and cases where they are notstrong. Conventionally, prediction parameters (LPC parameters) aretransmitted at a constant rate with regard to each frame. Consequently,the characteristics of the speech signals are not fully utilized.Therefore, the transmission data causes redundancies and the quality ofthe reproduced speech in the receiving station is not sufficient for theamount of transmission data.

SUMMARY OF THE INVENTION

An object of the present invention is to provide a mode-switching-typespeech encoding/decoding apparatus for providing a plurality of modeswhich depend on the transmission ratio between excitation informationand vocal tract information, and, upon encoding, switching to the modein which the best reproduction of speech quality can be obtained.

Another object of the present invention is to suppress redundancy oftransmission information, which prevents relatively stable vocal tractinformation from being transmitted, and instead assigning a lot of bitsto excitation information, which is useful for an improvement ofquality, thereby increasing the quality of the reproduced speech. Inorder to achieve the above object, the present invention has adopted thefollowing structure.

The present invention relates to a speech encoding apparatus forencoding a speech signal by separating the characteristics of saidspeech signal into articulation information (generally called vocaltract information) representing articulation characteristics of saidspeech signal, and excitation information representing excitationcharacteristics of said speech signal. Articulation characteristics arefrequency characteristics of a voice formed by the human vocal tract andnasal cavity, and sometimes refer to only vocal tract characteristics.Vocal tract information representing vocal tract characteristicscomprise LPC parameters obtained by forming a linear prediction analysisof a speech signal. Excitation information comprises, for example, aresidual signal. The present invention is also based on a speechdecoding apparatus. The present invention based on above speechencoding/decoding apparatus has the structure shown in FIG. 3.

A plurality of encoding units (or "ENCODERS #1 to #m")301-1 to 301-mlocally decode speech signal (or "INPUT") 303 by extracting vocal tractinformation (or "VOCAL TRACT PARAMETERS") 304 and excitation information(or "EXCITATION PARAMETERS") 305 from the speech signal 303, byperforming a local decoding on it. The vocal tract information andexcitation information are generally in the form of parameters. Thetransmission ratios of respective encoded information are different, asshown by the reference numbers 306-1 to 306-m in FIG. 3. The aboveencoding units comprise a first encoding unit for encoding a speechsignal by locally decoding it, and extracting LPC parameters and aresidual signal from it at every frame, and a second encoding unit forencoding a speech signal by performing a local decoding on it andextracting a residual signal from it using the LPC parameters from theframe several frames before the current one, the LPC parameters beingobtained by the first encoding units.

Next, evaluation/selection units (or "EVALUATION AND DECISION OF OPTIMUMENCODER") 302-1/302-2 evaluate the quality of respective decoded signals07-1 to 307-m subjected to local decoding by respective encoding units301-1 to 301-m, thereby providing the evaluation result. Then theydecide and select the most appropriate encoding units from among theencoding units 301-1 to 301-m, based on the evaluation result, andoutput a result of the selection (or "SELECT") as selection information310. The evaluation/selection units each comprise evaluation decisionunit 302-1 and selection unit 302-2, respectively as shown in FIG. 3.

The speech encoding apparatus of the above structure outputs vocal tractinformation 304 and excitation information 305 encoded by the encodingunits selected by evaluation/selection units 302-302-2, and outputsselection information 310 from evaluation/selection unit 302-1/302-2,to, for example, line 308.

Decoding unit (or "DECODER #") 309 decodes speech signal 311 fromselection information 310, vocal tract information 304 and excitationinformation 305, which are transmitted from the speech encodingapparatus.

With such a structure, evaluation/selection unit 302-1/302-2 selectsencoding output 304 and 305 of the encoding unit, which is evaluated tobe of good quality by decoding signals 307-1 to 307-m subjected to localdecoding.

In the portions of the speech signal in which vocal tract informationdoes not change, the LPC parameter is not output, thereby causing asurplus of information. As much of the surplus as possible is assignedto a residual signal, thereby improving the quality of decoded signal(or "OUTPUT") 311 obtained in a speech decoding apparatus.

In the block diagram shown in FIG. 3, the speech encoding apparatus iscombined with the speech decoding apparatus through a line 308, but itis clear that only the speech encoding apparatus or only the speechdecoding apparatus may be used at one time. Thus, the output from thespeech encoding apparatus is stored in a memory, and the input to thespeech decoding apparatus is obtained from the memory.

Vocal tract information is not limited to LPC parameters based on linearprediction analysis, but may be cepstrum parameters based, for example,on cepstrum analysis. A method of encoding the residual signal bydividing it into pitch information and noise information by a CELPencoding method or a RELP (Residual Excited Linear Prediction) method,for example, may be employed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a block diagram of a first prior art structure,

FIG. 2 shows a block diagram of a second prior art structure,

FIG. 3 depicts a block diagram for explaining the principle of thepresent invention,

FIG. 4 shows a block diagram of the first embodiment of the presentinvention,

FIG. 5 represents a block diagram of the second embodiment of thepresent invention,

FIG. 6 depicts an operation flow chart of the second embodiment,

FIG. 7A shows a table of an assignment of bits to be transmitted in thesecond prior art, and

FIG. 7B is a table of an assignment of bits to be transmitted in thesecond embodiment of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The embodiment of the present invention will be explained by referringto the drawings.

FIG. 4 shows a structural view of the first embodiment of the presentinvention, and this embodiment corresponds to the first prior artstructure shown in FIG. 1.

The first quantizer 403-1, predictor 404-1, adders 405-1 and 406-1, andLPC analysis unit 402 correspond to the portions designated by 103, 102,105, 106, and 101, respectively, in FIG. 1, thereby providing anadaptive prediction speech encoder. In this embodiment, a secondquantizer 403-2, a second predictor 404-2, and additional adders 405-2and 406-2 are further provided. The LPC parameters applied to predictor404-2 are provided by delaying the output from LPC analysis unit 402 inframe delay circuit 411 through terminal A of switch 410. The portionsin the upper stage of FIG. 4, which correspond to those in FIG. 1, causeoutput terminals 408 and 409 to transmit LPC parameters and a residualsignal, respectively. This is defined as A-mode. The signal transmittedfrom output terminal 412 in the lower stage of FIG. 4 is only theresidual signal, which is defined as B-mode. Evaluation units 407-1 and407-2 evaluate the S/N of the encoder of the A- or B-mode. Modedetermining (or "MODE DETERMINATION") portion 413 produces a signal A/Bfor determining which mode should be used (A-mode or B-mode) to transmitthe output to an opposite station (i.e. receiving station) (not shown),based on the evaluation. Switch (SW) unit 410 selects the A side whenthe A-mode is selected in the previous frame. Then, as LPC parameters ofthe B-mode for the current frame, the values of the A-mode of theprevious frame are used. When the B-mode is selected in the previousframe, the B side is selected and the values of the B-mode in theprevious frame, namely, the values of the A-mode in the frame which isseveral frames before the current frame, are used.

In this circuit structure, the encoders of the A-and B modes operate inparallel with regard to every frame. The A-mode encoder produces currentframe prediction parameters (LPC parameters) as vocal tract informationfrom output terminal 409, and a residual signal as excitationinformation through output terminal 408. In this case, the transmissionrate of the LPC parameters is β bits/frame and that of a residual signalis α bits/frame. The B-mode encoder outputs a residual signal fromoutput terminal 412 by using LPC parameters of the previous frame or aframe which is several frames before the current frame. In this case,the transmission rate of the residual signal is (α+β)bits/frame, so thenumber of bits for the residual signal can be increased by the number ofbits that are not being used for the LPC parameters, as the LPCparameters vary little. Input signals to predictors 404-1 and 404-2 arelocally decoded outputs from adders 406-1 and 406-2. They are equal tosignals that are decoded in the receiving station. Evaluation units407-1 and 407-2 compare these locally decoded signals with their inputsignals from input terminal 401 to evaluate the quality of the decodedspeech. Signal to quantization noise ratio SNR within a frame, forexample, is used for this evaluation, enabling evaluation units 407-1and 407-2 to output SN(A) and SN(B). The mode determination unit 413compares these signals, and if SN(A)>SN(B), a signal designating A-modeis output, and if SN(A)<SN(B), a signal designating B-mode is output.

A signal designating A-mode or B-mode is transmitted from modedetermination unit 413 to a selector (not shown). Signals from outputterminals 408, 409, and 412 are input to the selector. When the selectordesignates A-mode, the encoded residual signal and LPC parameters fromoutput terminals 408 and 409 are selected and output to the oppositestation. When the selector designates B-mode, the encoded residualsignal from output terminal 412 is selected and output to the oppositestation.

Selection of A- or B-modes is conducted in every frame. The transmissionrate is (α+β) bits per frame as described above and is not changed inany mode. The data of (α+β) bits per frame is transmitted to a receivingstation after a bit per frame representing an A/B signal designatingwhether the data is in A-mode or B-mode is added to the data of (α+β)bits per frame.

The data obtained in B-mode is transmitted if B-mode provides betterquality. Therefore, the quality of reproduced speech in the presentinvention is better than in the prior art shown in FIG. 1, and thequality of the reproduced speech in the present invention can never beworse than in the prior art.

FIG. 5 is a structural view of the second embodiment of this invention.This embodiment corresponds to the second prior art structure shown inFIG. 2. In FIG. 5, 501-1 and 501-2 depict encoders. These encoders areboth CELP encoders, as shown in FIG. 2. One of them, 501-1, performslinear prediction analysis on every frame by slicing speech into 10 to30 ms portions, and outputs prediction parameters, residual waveformpattern, pitch frequency, pitch coefficient, and gain. The otherencoder, 501-2, does not perform linear prediction analysis, but outputsonly a residual waveform pattern. Therefore, as described later, encoder501-2 can assign more quantization bits to a residual waveform patternthan encoder 501-1 can.

The operation mode using encoder 501-1 is called A-mode and theoperation mode using encoder 501-2 is called B-mode.

In encoder 501-1, linear prediction analysis unit 506 performs the samefunction as both LPC analysis unit 201 and quantizing unit 202. Whitenoise code book 507-1, gain controller 508-1, and error computing unit511-1, respectively, correspond to those features designated by thereference numbers 204, 205, and 210 in FIG. 2. Long-term prediction (or"LONG-TERM PREDICTOR") unit 509-1 corresponds to those featuresdesignated by the reference numbers 206 to 208 in FIG. 2. It performs anexcitation operation by receiving pitch data as described in conjunctionwith the second, prior art structure. Short-term prediction (or"SHORT-TERM PREDICTOR"), unit 510-1 corresponds to those featuresrepresented by the reference numbers 203 and 209 in FIG. 2, andfunctions as a vocal tract by receiving prediction parameters asdescribed in the second prior art. In addition, error evaluation unit512-1 corresponds to those features designated by the reference numbers211 and 212 in FIG. 2, and performs an evaluation of error power asdescribed in conjunction with the second prior art structure. In thiscase, error evaluation unit 512-1 sequentially designates addresses(phases) in white noise code book 507-1, and performs evaluations oferror power of all the code vectors (residual patterns) as described inthe second prior art structure. Then it selects the code vector that hasthe lowest error power, thereby producing, as the residual signalinformation, the number of the selected code vector in white noise codebook 507-1.

Error evaluation unit 512-1 also outputs a segmental S/N (S/N_(A)) thathas waveform distortion data within a frame.

Encoder 501-1, described in reference to FIG. 2, produces encodedprediction (or "PREDICTION") parameters (LPC parameters) from linearprediction analysis unit 506. It also produces encoded pitch period,pitch coefficient and gain (not shown).

In encoder 501-2, the portions designated by the reference numbers 507-2to 512-2 are the same as respective portions designated by referencenumbers 507-1 to 512-1 in encoder 501-1. Encoder 501-2 does not havelinear prediction analysis unit 506; instead, it has coefficient memory513. Coefficient memory 513 holds prediction coefficients (predictionparameters) obtained from linear prediction analysis unit 506.Information from coefficient memory 513 is applied to short termprediction (or "SHORT-TERM PREDICTOR") unit 510-2 as linear predictionparameters.

Coefficient memory 513 is renewed every time the A-mode is produced(every time output from encoder 501-1 is selected). It is not renewedand maintains the values when a B-mode is produced (when the output fromencoder 501-2 is selected). Therefore, the most recent predictioncoefficients transmitted to a decoder station (receiving station) arealways kept in coefficient memory 513.

Encoder 501-2 does not produce prediction parameters but producesresidual signal information, pitch period, pitch coefficients and gain.Therefore, as is described later, more bits can be assigned to theresidual signal information by the number of bits corresponding to thequantity of prediction parameters that are not output.

Quality evaluation/encoder selection unit 502 selects encoder 501-1 or501-2, whichever has the better speech reproduction quality, based onthe result obtained by a local decoding in respective encoders 501-1 and501-2. Quality evaluation/encoder selection unit 502 also uses waveformdistortion and spectral distortion of reproduced speech signals A and Bto evaluate the quality of speech reproduced by encoders 501-1 or 501-2.In other words, unit 502 uses segmental S/N and LPC cepstrum distance(CD) of respective frames in parallel to evaluate the quality ofreproduced speech.

Therefore, quality evaluation/encoder selection unit 502 is providedwith cepstrum distance (or "CD") computing unit 515, operation modejudgement unit 516, and switch 514.

Cepstrum distance computing unit 515 obtains the first LPC cepstrumcoefficients from the LPC parameters that correspond to the presentframe, and that have been obtained from linear prediction analysis unit516. Cepstrum distance computing unit 515 also obtains the second LPCcepstrum coefficients from the LPC parameters that are obtained fromcoefficient memory 513 and are currently used in the B-mode. Then itcomputes LPC cepstrum distance CD in the current frame from the firstand second LPC cepstrum coefficients. It is generally accepted that theLPC cepstrum distance thus obtained clearly expresses the differencebetween the above two sets of vocal tract spectral characteristicsdetermined by preparing LPC parameters (spectral distortion).

Operation mode judgement unit 516 receives segmental S/N_(A) and S/N_(B)from encoders 501-1 and 501-2, and receives the LPC cepstrum distance(CD) from cepstrum distance computing unit 515 to perform the processshown in the operation flow chart of FIG. 6.

This process will be described later.

Where operation mode judgement unit 516 selects the A-mode (encoder501-1), switch 514 is switched to the A-mode terminal side. Whereoperation mode judgement unit 516 selects B-mode (encoder 501-2), switch514 is switched to the B-mode terminal side. Every time A-mode isproduced (output from encoder 501-1 is selected) by a switchingoperation of switch 514, coefficient memory 513 is renewed. When theB-mode is produced (so that the output from encoder 501-2 is selected)coefficient memory 513 is not renewed and maintains the current values.Multiplexing (or "MUX") unit 504 multiplexes residual signal informationand prediction parameters from encoder 501-1. Selector 517 selects oneof the outputs obtained from multiplexing unit 504, i.e. either themultiplexed output (comprising residual signal information andprediction parameters) obtained from encoder 501-1 or the residualsignal information output from encoder 501-2, based on encoder numberinformation i obtained from operation mode judgement unit 516.

Decoder 518 outputs a reproduced speech signal based on residual signalinformation and prediction parameters from encoder 501-1, or residualsignal information from encoder 501-2. Thus decoder 518 has a structuresimilar to those of white noise code books 507-1 and 507-2, long-termprediction units 509-1 and 509-2, and short-term prediction units 510-1and 510-2 in encoders 501-1 and 501-2.

Separation unit (DMUX) 505 separates multiplexed signals transmittedfrom encoder 501-1 into residual signal information and predictionparameters.

In FIG. 5, units to the left of transmission path 503 are on thetransmitting side and units to the right are on the receiving side.

With the above structure, a speech signal is encoded with regard toprediction parameters and residual signals in encoder 501-1, or withregard to only the residual signals in encoder 501-2. Qualityevaluation/encoder selection unit 502 selects the number i of encoder501-1 or 501-2 that has the best speech reproduction quality, based onsegmental S/N information and LPC cepstrum distance information of everyframe. In other words, operation mode judgement unit 516 in qualityevaluation/encoder selection unit 502 carries out the following processin accordance with the operation flow chart shown in FIG. 6.

Encoder 501-1 or 501-2 is selected by inputting encoder number i. InA-mode, i=1; in B-mode i=2. If segmental S/N in encoder 501-1 is betterthan that of encoder 501-2 (S/N_(A) >S/N_(B)), the A-mode is selected byinputting encoder, number 1 (encoder 501-1) to selector 517 (in FIG. 6,S1→S2).

On the other hand, if segmental S/N in encoder 501-2 is better than thatof encoder 501-1 (S/N_(A) <S/N_(B)), the following judgement is furtherexecuted. LPC cepstrum distance CD from cepstrum computing unit 515 iscompared with a predetermined threshold value CD_(TH) (S3). When CD issmaller than the threshold value CD_(TH) (the spectral distortion issmall), B-mode is selected so that encoder number 2 is input (encoder501-2) to selector 517 (S4). When CD is larger than the above thresholdvalue CD_(TH) (the spectral distortion is large), A-mode is selected byinputting encoder number 1 (encoder 501-1) to selector 516 (S3→S2).

The above operation enables the most appropriate encoder to be selected.

The reason why two evaluation functions are used as described above isthat where A-mode is selected, linear prediction analysis unit 506always computes prediction parameters according to the current frame.This ensures that the best spectral characteristics are obtained, so theA-mode can be selected merely on the condition that the segmentalS/N_(A) that represents a distortion in the time domain is good. Incontrast, where B-mode is selected, although the segmental S/N_(B) thatrepresents a distortion in time domain may be good, this is sometimesmerely because the quantization gain of the reproduced signal in theB-mode is better. In this case, there is the possibility that spectralcharacteristics of the current frame (determined by the predictionparameters obtained from coefficient memory 513) may be greatly shiftedfrom the real spectral characteristics of the current frame (determinedby the prediction parameters obtained from linear prediction analysisunit 506). Namely, the prediction parameters obtained from coefficientmemory 513 are those corresponding to the previous frames, and theprediction parameters of the present frame may be very different fromthose of the previous frame, even though the distortion in time domainof B-mode is less than that of A-mode. In the above case, the reproducedsignal on the decoding side includes a large spectral distortion toaccomodate the human ear. Therefore, when B-mode is selected, it isnecessary to evaluate the distortion in frequency domain (spectraldistortion based on LPC cepstrum distance CD) in addition to thedistortion in time domain.

When the segmental S/N of encoder 501-2 is better than that of encoder501-1, and the spectral characteristics of the current frame are notvery different from those of the previous frame, the prediction spectrumof the current frame is not very different from that of the previousframe, so only the residual signal information is transmitted from theencoder 501-2. In this case, more quantizing bits are assigned to theresidual signal, and the quantization quality of the residual signal isincreased. A greater number of bits is transmitted than in the casewhere both prediction parameters and residual signals are transmitted tothe opposite station. The B-mode (encoder 501-2) can be effectivelyused, for example, when the same sound "aaah" continues to be enunciatedover a series of frames.

Coefficient memory 513 of encoder 501-2 is renewed every time the A-modeis selected (every time output from encoder 501-1 is selected).Coefficient memory 513 is not renewed, but maintains the values storedwhen the B-mode is selected (output from encoder 501-2 is selected).

After this, based on the selection result by quality evaluation/encoderselection unit 502, selector 517 selects encoder 501-1 or 501-2(whichever has the best quality of speech reproduction). The output ofthe quality evaluation/encoder selection unit 502 is transmitted totransmission path 503.

Decoder 518 produces the reproduced signal based on encoded output(residual signal information and prediction parameters from encoder501-1 or residual signal information alone from encoder 501-2) andencoder number data i, which are sent through transmission path 503.

The information to be transmitted to the receiving side comprises thecode numbers of residual signal information and quantized predictionparameters (LPC parameters), and so on, in the A-mode, and comprises thecode numbers of the residual signal information, and so on, in theB-mode. In the B-mode, the LPC parameter is not transmitted, but thetotal number of bits is the same in both the A-mode and B-mode. The codenumber shows which residual waveform pattern (code vector) is selectedin white noise code book 07-1 or 507-2. White noise code book 507-1 inencoder 501-1 contains a small number of residual waveform patterns(code vectors) and a small number of bits that represent the codenumber. In contrast, white noise code book 507-2 in encoder 501-2contains a large number of codes and a large number of bits thatcorrespond to the code number. Therefore, in B-mode, the reproducedsignal is likely to be more similar to the input signal.

Where the total transmission bit rate is 4.8 kbps, an example of theassignment of the transmission bit for one frame is shown in FIGS. 7Aand 7B in the second prior art structure shown in FIG. 2 and in thesecond embodiment shown in FIG. 5. FIGS. 7A and 7B clearly show that inA-mode, the bit assigned to each item of information in the embodimentof FIG. 7B is almost the same as that of the second prior art structureshown in FIG. 7A. However, in B-mode of the present embodiment shown inFIG. 7B, LPC parameters are not transmitted, so the bits not needed forthe LPC parameters can be assigned to the code number and gaininformation, thereby improving the quality of the reproduced speech.

As explained above, the present embodiment does not transmit predictionparameters for frames in which the prediction parameters of speech donot change much. The bits that are not needed for the predictionparameters are used to improve the sound quality of the data to betransmitted by increasing the number of bits assigned to the residualsignal, or that of bits assigned to the code number necessary forincreasing the capacity of the driving code table, thereby improving thequality of the reproduced speech signal on the receiving side.

In the present embodiment, in response to the dynamic characteristics ofthe excitation portion and vocal tract portion in a sound productionmechanism of natural human speech, the transmission ratio of theexcitation information to the vocal tract information can be controlledin the encoder. This prevents the S/N ratio from deteriorating even atlow transmission rates, and good speech quality is maintained.

It should be noted that both encoder 501-1 and 501-2 may produceresidual signal information and prediction parameter information. Inthis case, the ratios of bits assigned to the residual signalinformation and prediction parameters are different in the two encoders.

As is clear from the above, more than two encoders may be provided. Anencoder that produces residual signal information and predictionparameter information may work alongside some encoders that produce onlyresidual signal information. Note however, that the ratio bits assignedto residual signal information and prediction parameter informationdiffers depending on the encoders. In order to perform qualityevaluation of the reproduced speech in an encoder, in addition to thecase in which both waveform distortion and spectral distortion of thereproduced speech signal are used, either of these two distortions maybe used.

As described above in detail, the mode switching type speech encodingapparatus of the present invention provides a plurality of modes inregard to a transmission ratio of excitation information vocal tractinformation, and performs a switching operation between the modes toobtain the best reproduced speech quality. Thus, the present inventioncan control the transmission ratio of excitation information to vocaltract information in encoders, and satisfactory quality of sound can bemaintained even at a lower transmission rate.

What is claimed is:
 1. A speech encoding apparatus for encoding a speechsignal by separating a plurality of characteristics of said speechsignal into articulation information representing at least one of aplurality of articulation characteristics of said speech signal, andexcitation information representing at least one of a plurality ofexcitation characteristics of said speech signal, comprising:a pluralityof encoding means for encoding the articulation information and theexcitation information extracted from said speech signal by performing alocal decoding of said speech signal, each of said plurality of encodingmeans having a different ratio of a transmission rate between theencoded articulation information and the encoded excitation informationas compared to a similar ratio of other ones of said plurality ofencoding means; and evaluation/selection means for evaluating a qualityof each of a plurality of decoded signals based on the encodedarticulation information and the encoded excitation information, fromrespective ones of said plurality of encoding means to provide anevaluation result, and for determining and selecting a most appropriateone of the plurality of encoding means from among said plurality ofencoding means, based on the evaluation result, to output a resultindicative of the most appropriate one of the plurality of encodingmeans, as selection information, the encoding means selected by saidevaluation/selection means outputting said encoded articulationinformation and said encoded excitation information, and saidevaluation/selection means outputting said selection information.
 2. Thespeech encoding apparatus according to claim 1, wherein:saidarticulation information comprises at least one of a plurality of linearprediction coding parameters representing at least one of a plurality ofvocal tract characteristics, and said excitation information comprises aresidual signal representing at least one of a plurality of excitationcharacteristics.
 3. A speech encoding apparatus according to claim 1,whereinsaid evaluation/selection means evaluates the quality of each ofthe plurality of decoded signals by computing a waveform distortion foreach of the plurality of decoded signals, and determines and selects oneof said plurality of encoding means corresponding to one of theplurality of decoded signals which has a relatively small waveformdistortion compared to other ones of said plurality of decoded signals.4. A speech encoding apparatus according to claim 1, whereinsaidevaluation/selection means evaluates the quality of each of theplurality of decoded signals by computing a spectral distortion for eachof the plurality of decoded signals, and decides and selects one of saidplurality of encoding means corresponding to one of the plurality ofdecoded signals which has a relatively small spectral distortioncompared to other ones of the plurality of decoded signals.
 5. A speechencoding apparatus according to claim 1, whereinsaidevaluation/selection means evaluates the quality of each of theplurality of decoded signals by computing a waveform distortion and aspectral distortion for each of the plurality of decoded signals, anddetermines and selects one of said plurality of encoding means based onsaid waveform distortion and said spectral distortion.
 6. A speechencoding apparatus for encoding a speech signal by separating aplurality of characteristics of said speech signal into at least one ofa plurality of linear prediction coding parameters representing at leastone of a plurality of vocal tract characteristics of said speech signaland a residual signal representing at least one of a plurality ofexcitation characteristics of said speech signal at every predeterminedframe, comprising:first encoding means for encoding said speech signalby performing a local decoding of said speech signal to provide a firstdecoded signal and extracting at least one of a plurality of linearprediction coding parameters and said residual signal from said speechsignal at every predetermined frame; second encoding means for encodingsaid speech signal by performing a local decoding of said speech signalto provide a second decoded signal and extracting said residual signalfrom said speech signal by using said at least one of a plurality oflinear prediction coding parameters of a past frame preceding a presentframe, said at least one of a plurality of linear prediction codingparameters being obtained from said first encoding means;evaluation/selection means for evaluation a quality of said first andsecond decoded signals, to determine and select an appropriate one ofsaid first and second encoding means, wherein: when saidevaluation/selection means selects the first encoding means as theappropriate one of said first and second encoding means, said at leastone of a plurality of linear prediction coding parameters and saidresidual signal encoded by said first encoding means, and selectioninformation from said evaluation/selection means are output, and whensaid second encoding means is selected by said evaluation/selectionmeans as the appropriate one of said first and second encoding means,said residual signal encoded by said second encoding means and selectioninformation obtained by said evaluation/selection means are output.
 7. Aspeech encoding apparatus according to claim 6, whereinsaidevaluation/selection means evaluates the quality of said first andsecond decoded signals by computing a waveform distortion and a spectraldistortion for each of said first and second decoded signals, and saidevaluation/selection means determines and selects the first encodingmeans where the waveform distortion of the first decoded signal issmaller than the waveform distortion of the second decoded signal, andsaid evaluation/selection means determines and selects said firstencoding means where the waveform distortion of the second decodedsignal is smaller than the waveform distortion of the first decodedsignal and where the spectral distortion of the first decoded signal issmaller than the spectral distortion of the second decoded signal, andsaid evaluation/selection means determines and selects the secondencoding means, where the waveform distortion of the second decodedsignal is smaller than the waveform distortion of the first decodedsignal and where the spectral distortion of the second decoded signal issmaller than the spectral distortion of the first decoded signal.
 8. Aspeech decoding apparatus for decoding a speech signal, comprising:firstdecoding means for generating and outputting a first decoded speechsignal based on at least one of a first plurality of encoded linearprediction coding parameters and an encoded residual signal of a currentframe, when selection information is in a first state; and seconddecoding means for generating and outputting a second decoded speechsignal from at least one of a second plurality of encoded linearprediction coding parameters obtained before the current frame, and theencoded residual signal of the current frame, when selection informationis in a second state.
 9. A speech encoder/decoder apparatus for encodinga speech signal by separating a plurality of characteristics of saidspeech signal into articulation information representing at least one ofa plurality of articulation characteristics of said speech signal, whichis encoded to provide encoded articulation information, and excitationinformation representing at least one of a plurality of excitationcharacteristics of said speech signal, which is encoded to provideencoded excitation information, and for decoding said speech signalbased on said encoded articulation information, and on said encodedexcitation information, comprising:a plurality of encoding means forencoding the articulation information and the excitation informationextracted from said speech signal by performing a local decoding of saidspeech signal, a transmission ratio of said articulation information tosaid excitation information in one of said plurality of encoding meansbeing different from a similar transmission ratio in another one of saidplurality of encoding means; evaluation/selection means for evaluatingquality of each of a plurality of decoded speech signals based on theencoded articulation information and the encoded excitation information,from respective ones of said plurality of encoding means to provide anevaluation result, and for determining and selecting a most appropriateone of the plurality of encoding means from among said plurality ofencoding means, based on said evaluation result, to output a resultindicative of the most appropriate one of the plurality of encodingmeans as selection information; and decoding means for decoding saidspeech signal to generate each of the plurality of decoded speechsignals using said selection information from said evaluation/selectionmeans and said articulation information and said excitation informationencoded by the most appropriate one of the plurality of encoding meansselected by said evaluation/selection means.
 10. A method for adjustingan amount of vocal tract information used in a communication system,comprising the steps of:a) encoding an input signal based on at leastone of a plurality of linear prediction coding parameters during a firsttime period to provide a first encoded signal including a first amountof vocal tract information; b) encoding the input signal based on the atleast one of the plurality of linear prediction coding parameters duringa second time period to provide a second encoded signal including asecond amount of vocal tract information which is different from thefirst amount of vocal tract information; c) decoding the first encodedsignal of said step (a) to provide a first decoded signal; d) comparingthe first decoded signal of said step (c) with the input signal toprovide a first result signal; e) decoding the second encoded signal ofsaid step (b) to provide a second decoded signal; f) comparing thesecond decoded signal of said step (e) with the input signal to providea second result signal; g) comparing the first and second result signalsof said steps (d) and (f), respectively, to provide a third resultsignal; and h) reproducing the input signal for use as an output signalby sing at least one of the first and second encoded signals of saidsteps (a) and (b), respectively, based on the third result signal ofsaid step (g).
 11. A method for selecting between a first encoded signaland a second encoded signal for use in reproducing an input signal,comprising the steps of:a) decoding the first encoded signal to providea first decoded signal; b) decoding the second encoded signal to providea second decoded signal; c) comparing the first decoded signal of saidstep (a) to the input signal to provide a first signal-to-noise ratio;d) comparing the second decoded signal with the input signal to providea second signal-to-noise ratio; e) determining whether the firstsignal-to-noise ratio is greater than the second signal-to-noise ratio;f) selecting the first encoded signal to reproduce the input signal ifthe first signal-to-noise ratio is greater than the secondsignal-to-noise ratio; g) computing a cepstrum distance based on thesecond encoded signal; h) comparing the cepstrum distance with apredetermined value; i) selecting the second encoded signal to reproducethe input signal if the cepstrum distance is greater than thepredetermined value; and j) selecting the first encoded signal toreproduce the input signal when the cepstrum distance is not greaterthan the predetermined value.
 12. A method for improving quality of anencoded input signal, comprising the steps of:a) encoding an inputsignal based on at least one of a plurality of modes which each have atransmission ratio between excitation information and vocal tractinformation which differs from any of the other ones of the plurality ofmodes, to provide a plurality of encoded signals; b) reproducing theinput signal using at least one of plurality of encoded signals toprovide a plurality of reproduced signals; c) comparing the plurality ofreproduced signals with the input signal; and d) selecting one of theplurality of an encoded signals as the encoded input signal, based onsaid step (c).